Why is this relevant?
In learning R today, we’ll take an approach, used in R for Data Science (Wickham & Grolemund, 2017), diagrammed like this:
One of the strengths of R: every step of this workflow can be done using it!
install.packages('tidyverse')A package only needs to be installed once (per major version, e.g. 3.3.x to 3.4.x), but must be loaded every time.
Data on the fuel efficiency of 38 models of cars between 1999 and 2008 from the US EPA:
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
aes)hwy vs
cyl.class vs
drv?What’s going on here? How might we make it more useful?
What about those outliers? Use the color aesthetic.
What’s happening here?
What’s happened here? What colour are the points? Why?
mpg variable types: which are
categorical, which are continuous??mpgaes(color = displ < 5)?## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
?geom_smooth
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv,
linetype = drv))## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
A common scenario:
## [1] 0.15
## [1] 44.66667
## [1] 1
Calling R functions, in general:
function_name(arg1 = val1, arg2 = val2, ...)
An example function, seq:
## [1] 1 2 3 4 5 6 7 8 9 10
dplyr and tidyr
(tidyverse)We’ll look at Statistics Canada CANSIM Table 0477-0058 which covers university revenues.
To save time, I’ve already downloaded the data from CANSIM. You can get it here:
We’ll use read_csv() from readr, which I
prefer to the base R function, as it has some more sensible defaults,
and it’s faster.
## Rows: 122240 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Ref_Date, GEO, SCHOOL, REVENUE, FUND, Vector, Coordinate
## dbl (1): Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The read_csv() function tells us that it guessed the
types of the various columns. In this situation, the default guesses are
fine, but of course we can force it to treat columns certain ways if we
wish.
## # A tibble: 122,240 × 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 3.19e7
## # ℹ 122,230 more rows
We have 8 columns and 122,240 rows of data. read_csv()
brings the data in as a tibble, which is just an R “data
frame”, but with some handy defaults, some of which we’re seeing here.
For instance, it gives us the size of the data frame in rows and
columns, the types of the columns (e.g., “<chr>” for character)
and only prints the first 10 rows, instead of overwhelming us with all
of the data.
Or click the icon in the Environment tab.
(See Wickham, 2014 and R for Data Science, chapter 12)
Is the CANSIM data tidy?
Pick observations by their values
(filter()).
Reorder the rows
(arrange()).
Pick variables by their names
(select()).
Create new variables with functions of existing
variables (mutate()).
Collapse many values down to a single summary
(summarise()).
All can be used in conjunction with
group_by()
https://www.rstudio.com/resources/cheatsheets/
(or Help | Cheatsheets in RStudio!)
filter() the rows we want## # A tibble: 5,632 × 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 3.19e7
## # ℹ 5,622 more rows
filter() help tells us that everything after the first
argument (i.e., the “…”) are: “Logical predicates defined in terms of
the variables in .data. Multiple conditions are combined with
&.”
Other operators we can use with filter():
Complete set of boolean operations. x is the left-hand
circle, y is the right-hand circle, and the shaded regions
show which parts each operator selects.
filter()In other words, the previous command is equivalent to this:
## # A tibble: 5,632 × 8
## Ref_Date GEO SCHOOL REVENUE FUND Vector Coordinate Value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2000/2001 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.62e7
## 2 2001/2002 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.72e7
## 3 2002/2003 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 1.85e7
## 4 2003/2004 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.16e7
## 5 2004/2005 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.27e7
## 6 2005/2006 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.47e7
## 7 2006/2007 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.65e7
## 8 2007/2008 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.72e7
## 9 2008/2009 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 2.64e7
## 10 2009/2010 Canada Total universities a… Total … Tota… v8070… 1.1.1.1 3.19e7
## # ℹ 5,622 more rows
I prefer this notation, as it’s more explicit.
filter()But, one more thing: we need to assign the return value of the
filter() function back to a variable:
revenue <- filter(revenue_raw,
SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)')Shortcut for assignment operator: Alt-Hyphen in RStudio
A new variable for each step gets cumbersome, so dplyr
provides an operator, the pipe (%>%) that combines
operations:
revenue_long <- revenue_raw %>%
# only rows matching this
filter(SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)') %>%
# remove these columns
select(-SCHOOL, -FUND, -Vector, -Coordinate) %>%
# fix up the date column
mutate(Ref_Date = as.integer(stringr::str_sub(Ref_Date, 1, 4)))x %>% f(y) turns into f(x, y), and
x %>% f(y) %>% g(z) turns into
g(f(x, y), z) etc.
Shortcut for pipe operator: Ctrl-Shift-M or Cmd-Shift-M in RStudio
## # A tibble: 1 × 4
## Ref_Date GEO REVENUE Value
## <int> <chr> <chr> <dbl>
## 1 2000 Canada Total revenues 16224715
starts_with("abc"): matches names that
begin with “abc”.ends_with("xyz"): matches names that
end with “xyz”.contains("ijk"): matches names that
contain “ijk”.matches("(.)\\1"): selects variables
that match a regular expression.num_range("x", 1:3): matches x1, x2
and x3.all_of(vector): all of the columns in
the specified character vectorany_of(vector): any of the columns in
the specified character vectortidyrThe two main tidyr functions:
pivot_longer(data, cols, names_to, values_to):
“lengthens” the data by moving the column names into
names_to column and the values going into the
values_to column.pivot_wider(data, names_from, values_from):
“widens” the data by taking the names in names_from and
making them into columns, with the values from
values_fromtidyr Cheatsheethttps://www.rstudio.com/resources/cheatsheets/ (Data Import Cheatsheet)
pivot_wider() CANSIM## # A tibble: 176 × 34
## Ref_Date GEO `Total revenues` Federal Social Sciences and Humanities Res…¹
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2000 Canada 16224715 1554234 104306
## 2 2001 Canada 17231250 1846138 109511
## 3 2002 Canada 18488133 2223239 133368
## 4 2003 Canada 21647925 2534964 178803
## 5 2004 Canada 22687437 2635758 203082
## 6 2005 Canada 24705347 2833294 221604
## 7 2006 Canada 26458623 2874651 222010
## 8 2007 Canada 27201804 3060336 220779
## 9 2008 Canada 26405228 3075585 236172
## 10 2009 Canada 31929458 3752740 250356
## # ℹ 166 more rows
## # ℹ abbreviated name: ¹`Social Sciences and Humanities Research Council`
## # ℹ 29 more variables: `Health Canada` <dbl>,
## # `Natural Sciences and Engineering Research Council` <dbl>,
## # `Canadian Institute of Health Research` <dbl>,
## # `Canada Foundation for Innovation` <dbl>, `Canada Research Chairs` <dbl>,
## # `Other federal` <dbl>, `Non-federal` <dbl>, Provincial <dbl>, …
revenue %>%
filter(GEO == 'Canada') %>%
# pass the result to ggplot() as the first argument
ggplot(aes(Ref_Date, `Total revenues`)) +
# now it switches to + to combine, which is ggplot's way
geom_line()Not the prettiest, but it works!
revenue %>%
filter(GEO == 'Canada') %>%
ggplot(aes(Ref_Date, `Total revenues`)) +
geom_line() +
labs(title = 'Total University and Degree-Granting College Revenue',
x = 'Fiscal Year', y = 'Revenue ($ thousands)') +
scale_y_continuous(labels = scales::comma)revenue %>%
filter(GEO %in% c('Canada', 'Alberta', 'Ontario')) %>%
ggplot(aes(Ref_Date, `Tuition and other fees` / `Total revenues`, color = GEO)) +
geom_line() +
labs(title = 'Tuition and fees as a share of total revenue') +
scale_y_continuous(labels = scales::percent)dplyr “verbs”Which province had the most tuition revenue in 2010?
revenue %>%
filter(GEO != 'Canada', Ref_Date == 2010) %>%
select(GEO, `Tuition and other fees`) %>%
arrange(desc(`Tuition and other fees`))## # A tibble: 10 × 2
## GEO `Tuition and other fees`
## <chr> <dbl>
## 1 Ontario 3563011
## 2 British Columbia 1009476
## 3 Quebec 825943
## 4 Alberta 711973
## 5 Nova Scotia 305597
## 6 Manitoba 176165
## 7 Saskatchewan 173600
## 8 New Brunswick 146213
## 9 Newfoundland and Labrador 61537
## 10 Prince Edward Island 27878
revenue %>%
select(Ref_Date, GEO, `Tuition and other fees`, `Total revenues`) %>%
mutate(tuition_share = `Tuition and other fees` / `Total revenues`)## # A tibble: 176 × 5
## Ref_Date GEO `Tuition and other fees` `Total revenues` tuition_share
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2000 Canada 3052960 16224715 0.188
## 2 2001 Canada 3319524 17231250 0.193
## 3 2002 Canada 3777453 18488133 0.204
## 4 2003 Canada 4358013 21647925 0.201
## 5 2004 Canada 4706894 22687437 0.207
## 6 2005 Canada 4943699 24705347 0.200
## 7 2006 Canada 5216088 26458623 0.197
## 8 2007 Canada 5454375 27201804 0.201
## 9 2008 Canada 5853354 26405228 0.222
## 10 2009 Canada 6482539 31929458 0.203
## # ℹ 166 more rows
Keeps the same number of rows
+, -, *, /, ^%/% (integer
division), %% (remainder)log(), log2(), log10()lead(),
lag()cumsum(), cumprod(), cummin(),
cummax(), cummean()<, <=, >, >=, !=, ==min_rank(),
row_number(), dense_rank(),
percent_rank(), cume_dist(),
ntile()if_else()dplyr::recode()case_when()revenue %>%
group_by(GEO) %>%
summarise(avg_endowment_revenue = mean(Endowment)) %>%
arrange(-avg_endowment_revenue)## # A tibble: 11 × 2
## GEO avg_endowment_revenue
## <chr> <dbl>
## 1 Canada 526071.
## 2 Ontario 211475.
## 3 Quebec 87323.
## 4 Alberta 80069
## 5 British Columbia 75752.
## 6 Nova Scotia 31148.
## 7 Saskatchewan 21256.
## 8 New Brunswick 13778.
## 9 Newfoundland and Labrador 2878.
## 10 Manitoba 1646.
## 11 Prince Edward Island 747.
mean(x), median(x)sd(x), IQR(x)
(interquartile range), mad(x) (median absolute
deviation)min(x), max(x),
quantile(x, 0.25)first(x), nth(x, 2),
last(x)n(x), n_distinct(x)sum(x > 10), mean(y == 0)
TRUE is converted to 1
and FALSE to 0Compute the total tuition revenue of the three western-most provinces in 2007.
revenue %>%
filter(GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan'), Ref_Date == 2007) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 1 × 1
## `sum(\`Tuition and other fees\`)`
## <dbl>
## 1 1374509
These three provinces compared to all other provinces and national average.
revenue %>%
filter(Ref_Date == 2007) %>%
mutate(category = case_when(
GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan') ~ '3 Western Provinces',
GEO == 'Canada' ~ GEO,
TRUE ~ 'All Other Provinces'
)) %>%
group_by(category) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 3 × 2
## category `sum(\`Tuition and other fees\`)`
## <chr> <dbl>
## 1 3 Western Provinces 1374509
## 2 All Other Provinces 4079866
## 3 Canada 5454375
tuition_relative_change <- revenue %>%
arrange(Ref_Date, GEO) %>%
group_by(GEO) %>%
mutate(rel_change = (`Tuition and other fees` - first(`Tuition and other fees`)) / first(`Tuition and other fees`)) %>%
select(Ref_Date, GEO, `Tuition and other fees`, rel_change)
ggplot(tuition_relative_change, aes(Ref_Date, rel_change, color = GEO)) +
geom_line()Count years with positive endowment income by province
## # A tibble: 11 × 2
## GEO num_pos_endow
## <chr> <int>
## 1 Alberta 13
## 2 British Columbia 13
## 3 Canada 13
## 4 Manitoba 14
## 5 New Brunswick 13
## 6 Newfoundland and Labrador 13
## 7 Nova Scotia 14
## 8 Ontario 12
## 9 Prince Edward Island 13
## 10 Quebec 16
## 11 Saskatchewan 12
purrr (quickly)revenue %>%
filter(GEO != 'Canada') %>%
group_by(Ref_Date) %>%
summarise(revenue_quantiles = list(quantile(`Total revenues`, c(0.25, 0.5, 0.75)))) %>%
mutate(
low_25 = map_dbl(revenue_quantiles, "25%"),
mid = map_dbl(revenue_quantiles, '50%'),
high_75 = map_dbl(revenue_quantiles, '75%')
)## # A tibble: 16 × 5
## Ref_Date revenue_quantiles low_25 mid high_75
## <int> <list> <dbl> <dbl> <dbl>
## 1 2000 <dbl [3]> 424965 681742. 1944551.
## 2 2001 <dbl [3]> 417664. 700424 2126550.
## 3 2002 <dbl [3]> 465112. 729434 2378510.
## 4 2003 <dbl [3]> 480404. 804636. 2708008.
## 5 2004 <dbl [3]> 530995. 853132. 2818174.
## 6 2005 <dbl [3]> 560023. 911819 3257599.
## 7 2006 <dbl [3]> 616412. 964536. 3320636.
## 8 2007 <dbl [3]> 592927. 944042. 3560364.
## 9 2008 <dbl [3]> 568461. 968932. 3404852.
## 10 2009 <dbl [3]> 741016. 1161066 4363439.
## 11 2010 <dbl [3]> 787082. 1213456 4436956.
## 12 2011 <dbl [3]> 775059. 1169906. 4177081
## 13 2012 <dbl [3]> 799797. 1240542 4296184.
## 14 2013 <dbl [3]> 791843. 1298845 4437751.
## 15 2014 <dbl [3]> 799297. 1316785 4623504.
## 16 2015 <dbl [3]> 770571. 1261936. 4451983.
## [1] NA
## [1] NA
## [1] NA
What about this?
NA / 2
## [1] TRUE
filter() only includes rows where the condition is
TRUE; it excludes both FALSE and
NA values. If you want to preserve missing values, ask for
them explicitly:
## # A tibble: 1 × 1
## x
## <dbl>
## 1 3
## # A tibble: 2 × 1
## x
## <dbl>
## 1 NA
## 2 3
## [1] 1 2 3
## [1] 1 2 3
## [1] 2 4 6 5 7 9 8 10 12
What about…
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
What happened?
## [1] 2
What do we expect here?
mean(c(1,NA,3))
## [1] NA
## [1] 2
## x y z
## 1 2 4
## a b c
## 1 2 3
## [1] "three" "two" "five"
## [1] "two" "four"
#> [1] "two" "four"
# subset a named vector w/character vector
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]## xyz def
## 5 2
## [1] 10 3 5 8 1
## [1] 10 NA 8 NA
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
When we have the data the way we want it, we want to ask questions about it. To determine the relationship between parts of the data.
This can range from exploratory modeling (to better understand the data) to more formally trying to establish relationships between variables. The process is the same for both, but the mindset is different.
The modelr package has a simulated dataset called
sim1.
## # A tibble: 30 × 2
## x y
## <int> <dbl>
## 1 1 4.20
## 2 1 7.51
## 3 1 2.13
## 4 2 8.99
## 5 2 10.2
## 6 2 11.3
## 7 3 7.36
## 8 3 10.5
## 9 3 10.5
## 10 4 12.4
## # ℹ 20 more rows
sim1sim1geom_ablinegeom_abline plots a linear equation
geom_ablineThat’s just one model - how can we tell if it’s good or not?
geom_abline. Can you get something that looks like a better
model?We look at the difference between the predictions and the actual data.
We can find this for any model by creating an R function. We take the
intercept and add the slope multiplied by x.
## [1] 12.5 12.5 12.5 13.0 13.0 13.0 13.5 13.5 13.5 14.0 14.0 14.0 14.5 14.5 14.5
## [16] 15.0 15.0 15.0 15.5 15.5 15.5 16.0 16.0 16.0 16.5 16.5 16.5 17.0 17.0 17.0
Once we have the model’s prediction, we can look at the difference between those and what the actual data is.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
diff
}
measure_distance(sim1, c(12, 0.5))## [1] -8.3000870 -4.9893659 -10.3745272 -4.0111426 -2.7568946 -1.7031768
## [7] -6.1436353 -2.9946506 -2.9883992 -1.5654109 -2.1073988 0.2579641
## [13] 4.6300498 -2.7619788 1.5248539 -1.7260230 0.9559750 1.8947962
## [19] 4.5859927 1.6718503 4.4363088 5.7259025 2.3909129 6.4755526
## [25] 10.2770099 6.3051098 4.6283053 7.9680994 6.3464221 4.9752007
We want to both put all distances as positive numbers and more heavily penalize predictions that are far from the actual values - so we square the diffference.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
diff ^ 2
}
measure_distance(sim1, c(12, 0.5))## [1] 68.89144476 24.89377199 107.63081509 16.08926474 7.60046760
## [6] 2.90081117 37.74425497 8.96793224 8.93052986 2.45051128
## [11] 4.44112956 0.06654547 21.43736106 7.62852691 2.32517942
## [16] 2.97915534 0.91388814 3.59025256 21.03132891 2.79508358
## [21] 19.68083613 32.78595957 5.71646454 41.93278203 105.61693163
## [26] 39.75440948 21.42120989 63.49060769 40.27707338 24.75262198
We want to summarize it - so we take the mean across all the predictions and take the square root to put the differences back in terms of the original units.
measure_distance <- function(data, mod) {
diff <- data$y - model1(data, mod)
sqrt(mean(diff ^ 2))
}
measure_distance(sim1, c(12, 0.5))## [1] 4.995789
So, to choose between models – we can just put in different values.
## [1] 2.665212
Use this function to try a few different models.
Instead of guessing ourselves, we have the computer generate a lot of models and pick the one with the lowest error.
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
sim1_dist <- function(a1, a2) {
measure_distance(sim1, c(a1, a2))
}
models %>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist)) -> models
models## # A tibble: 250 × 3
## a1 a2 dist
## <dbl> <dbl> <dbl>
## 1 27.0 -1.44 10.9
## 2 -0.0700 -1.48 25.9
## 3 0.990 -1.29 23.8
## 4 9.43 1.96 5.16
## 5 39.2 -0.219 23.5
## 6 -16.0 2.79 16.4
## 7 39.4 -4.70 19.6
## 8 19.7 -3.76 23.5
## 9 37.5 4.57 47.8
## 10 21.9 -4.48 26.2
## # ℹ 240 more rows
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()## # A tibble: 10 × 3
## a1 a2 dist
## <dbl> <dbl> <dbl>
## 1 4.00 1.95 2.28
## 2 4.90 1.79 2.39
## 3 1.51 2.45 2.47
## 4 3.90 1.79 2.84
## 5 6.23 2.15 3.31
## 6 3.67 1.65 3.67
## 7 1.72 1.92 3.91
## 8 8.52 0.856 4.64
## 9 8.28 2.06 4.64
## 10 -2.38 3.44 4.65
Plot those models on the data. Lighter colored models are closer.
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, color = "grey30") +
geom_abline(aes(intercept = a1, slope = a2, color = -dist),
data = filter(models, rank(dist) <= 10))We plot every model, with distance as color. We highlight our 10 best models in red.
ggplot(models, aes(a1, a2)) +
geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))We are interested in the area where all the best random models are. It is unlikely that one of our random models is the best model. So, you can zoom in on the area where the best models tend to be. We create a regular grid of model parameters.
Note the axes.
grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))What do we get as the new best 10 models?
## a1 a2 dist
## 1 4.375000 2.000000 2.137234
## 2 4.375000 2.083333 2.155409
## 3 3.333333 2.166667 2.168677
## 4 5.416667 1.916667 2.210294
## 5 3.333333 2.250000 2.212637
## 6 5.416667 1.833333 2.218550
## 7 4.375000 1.916667 2.241533
## 8 3.333333 2.083333 2.246169
## 9 4.375000 2.166667 2.293149
## 10 2.291667 2.333333 2.308274
The contour lines represent equal “bins” of distance.
grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist)) +
geom_contour(aes(z = -dist), colour = "orange")ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)The optim function chooses the parameters that minimize
distance. The c(0,0) is the starting point.
## [1] 4.222248 2.051204
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])Of course, R has a tool to do this whole process automatically. This is our standard OLS regression.
## (Intercept) x
## 4.220822 2.051533
We can use a model object like sim1_mod to generate
predictions. We an start with an empty grid.
## # A tibble: 10 × 1
## x
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## # A tibble: 10 × 2
## x pred
## <int> <dbl>
## 1 1 6.27
## 2 2 8.32
## 3 3 10.4
## 4 4 12.4
## 5 5 14.5
## 6 6 16.5
## 7 7 18.6
## 8 8 20.6
## 9 9 22.7
## 10 10 24.7
geom_line.ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Add residuals. We need to use the original dataset.
## # A tibble: 30 × 3
## x y resid
## <int> <dbl> <dbl>
## 1 1 4.20 -2.07
## 2 1 7.51 1.24
## 3 1 2.13 -4.15
## 4 2 8.99 0.665
## 5 2 10.2 1.92
## 6 2 11.3 2.97
## 7 3 7.36 -3.02
## 8 3 10.5 0.130
## 9 3 10.5 0.136
## 10 4 12.4 0.00763
## # ℹ 20 more rows
broom package)##
## Welch Two Sample t-test
##
## data: disp by vs
## t = 5.9416, df = 26.977, p-value = 2.477e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 114.3628 235.0229
## sample estimates:
## mean in group 0 mean in group 1
## 307.1500 132.4571
##
## Pearson's product-moment correlation
##
## data: mtcars$disp and mtcars$mpg
## t = -8.7472, df = 30, p-value = 9.38e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9233594 -0.7081376
## sample estimates:
## cor
## -0.8475514
Using survey data on Smoking and Exercise
##
## Freq None Some
## Heavy 7 1 3
## Never 87 18 84
## Occas 12 3 4
## Regul 9 1 7
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828
Start here:
Additional:
For help/community: